VIS(US) Stuttgart – SpRay / See Linq

VAST 2009 Challenge
Challenge 1: -  Badge and Network Traffic

Authors and Affiliations:

Julian Heinrich, VISUS - Uni Stuttgart, Julian.Heinrich@vis.uni-stuttgart.de
Christoph Mueller, VISUS - Uni Stuttgart,
Christoph.Mueller@vis.uni-stuttgart.de
Guido Reina, VISUS - Uni Stuttgart,
Guido.Reina@vis.uni-stuttgart.de

Tool(s):

SpRay was developed during the master thesis of Julian Heinrich at the Eberhard-Karls-Universität Tübingen and was originally targeted at the visual exploration of gene expression data. SpRay is a generic visual analytics tool using a tight integration of interactive visualization and the statistical programming language R.

See Linq was developed during the VAST ’09 contest by Guido Reina and Christoph Müller at the Visualization Institute of Universität Stuttgart (VISUS). It is based on a queryable model and employs .Net mechanisms to integrate interactively formulated queries as data sources for linking and brushing. The visualization was developed using rapid prototyping and supports time-based events, and even though the glyphs are customized for this contest, the visualization can be easily adopted to other tasks.

 

Video:

 

Video

 

 

ANSWERS:


MC1.1: Identify which computer(s) the employee most likely used to send information to his contact in a tab-delimited table which contains for each computer identified: when the information was sent, how much information was sent and where that information was sent.

Traffic.txt

 


MC1.2:  Characterize the patterns of behavior of suspicious computer use.

To identify suspicious computer use, we first tried to identify irregular network traffic and then related it to the badge log to single out the potential mole.

 

Using SpRay, we visualized the connection matrix source/destination IP as tables and parallel coordinates. The visualization reveals one connection count outlier, 37.170.30.250, and one count that is extremely regular: 37.170.100.200 (Figure 1). Inspecting the traffic itself in a table shows that only port 25 is used. Selecting this port in the PC plot confirms that actually all mail traffic is directed at 37.170.30.250. We hypothesized that data theft via mail is too risky (usually mail is logged), and thus exluded all mail traffic. Traffic to 37.170.100.200 is equally caused by all employee machines and its count in the linked table approximately matches the number of working days in a month (20/21), therefore it is probably not suspicious. Using the total upload size in the connection matrix reveals another outlier in parallel coordinates: the top 13 uploads go to 100.59.151.133, which we thereby define as suspicious.

 

Figure 1: Parallel coordinate plot linked with R backend, displaying the connection count per destination address

 

To find a relation to the badge traffic, we devised a visualization that presents ip traffic in context with badge events (see Figure 2). Since our data model checks for basic consistency, like a strict succession of prox-in-classified and prox-out-classified, we found badges 38 and 49 as well as 30 to be inconsistent. The former two have multiple presences in the classified room (missing prox-out-classifieds) and the latter negative presence (missing prox-in-classifieds). We adjusted this by inserting missing prox-out-classifieds just before the next prox-in-building and missing prox-in-classifieds just after the previous prox-in-buildings programmatically. These virtual events are visualized in red to invalidate the potential ensuing suspicious traffic.

 

Figure 2: Badge and IP traffic visualized together, X-axis represents time from left to right, Y-axis represents employees

 

Figure 3: Top left: dataset with the unmatched prox-in and prox-out-classified events caused by e.g. piggybacking. Bottom right: after programmatic insertion of virtual events, the classified presence of employees #30, #38, and #49 can be determined, but false positives for network traffic on machine #30 appear (empty circles).

 

Highlighting the traffic to 37.170.100.200, it becomes evident that it is either caused by some kind of login process or some bulletin system which, if accessed, every employee accesses once a day as the first traffic from his machine. We verified this by formulating an exact query to the underlying data model. If no access to this machine occurs, the employee enters the classified room before generating any IP traffic, so the necessary information must be available there as well. We verified the 21 occurrences manually in the visualization. The remaining exceptions are only three and represent uploads to the already suspicious 100.59.151.133.

 

Highlighting the remaining 15 uploads to 100.59.151.133, one can see in the visualization that some of them happen while the computer owner is in the classified room. These uploads are also conspicuously isolated. Hence we defined that the computer owner probably did not trigger the uploads. We supposed that the mole does not necessarily use his own computer to upload the stolen data to obfuscate his actions. We wanted to find out which employee has enough time to trigger these suspicious uploads. We defined a variable time before and after each upload during which the suspect is not allowed to be in the classified room. Applying this filter programmatically with a 2 minutes time before and after the upload, only employees #27 and #30 remain. We manually examined the uploads for both suspects and found that on 01/22, #27 entered the building after the upload and also does not exhibit any other network traffic before badging in. Therefore #27 is not the suspect.

 

Figure 4: Suspicious uploads to 100.59.151.133 along with inconsistent traffic (circled). Inconsistencies for #30 stem from adjusted data. Only the two potential moles, #27 and #30, are never in the classified room during these uploads. The red box shows a magnification of the incident when the suspicious upload on 01/22 happens before #27 enters the embassy.

 

For each upload event, we then checked whether the suspicious uploads were conducted without the computer owner or his/her roommate being present. Mostly potential disturbers are inside the classified room or their machines generate no traffic for a significant time, which we interpreted as absence. When #30 uses his neighbor's machine (37.170.100.31), his own machine shows traffic regardless, which supports our suspicion. On 01/24 there are two very risky uploads, with very little time to complete the upload without being surprised by the machine owner or his neighbor (10 minutes and 3 minutes). This might be in consequence of an increasing talkativeness of the mole towards the end of the month: As the amount of data transmitted per day increases by a factor of 3 between 01/08 and 01/31, more uploads from different machines are required to send everything. However, all rooms are across the aisle from his own and we hypothesized at least a semi-automatic upload process as to minimize the mole's time at another machine (because of contract termination).

 

Figure 5: Spatial distribution of the machines used to transmit data. Most of the machines are within easy reach from the desk of employee #30.

 

Our conclusion is that employee #30 is the mole who uploads classified information 18 times to 100.59.151.133 using 12 different machines.

 

We derive the following patterns from our analysis:

- Mail traffic is too easy to track and too risky for leaking confidential data

- Data theft is defined by large RequestSize in the IP traffic

- Information is always sent to the same destination

- Information is always sent on Tuesday and Thursday.

- The mole never uses his own machine for uploads

- The mole can only hijack machines when alone (termination)

 

We also tried alternative approaches to find additional suspicious uploads with programmatic queries:

- The obvious pattern of an employee leaving the classified room and after a certain time t generating an upload. Setting t to 10 minutes revealed no significant clusters of source or destination. The highest uploads are individuals or machines that are commonly accessed from at least half of the employee computers.

- We also tried to search for the request/response ratio instead of the absolute of the request. Also in this case the top ten traffic consisted of only commonly-accessed machines and the already known 100.59.151.133. So this approach only confirmed our primary suspect.